Diagnosing Solid Lesions in the Pancreas With Multimodal Artificial Intelligence

Key Points Question Can a multimodal artificial intelligence (AI) model facilitate a clinical diagnosis of solid lesions in the pancreas? Findings In this randomized crossover trial based on a prospective dataset of 130 patients who underwent endoscopic ultrasonographic (EUS) procedures, a multimodal AI model, incorporating endoscopic EUS images and clinical data, demonstrated robustness across internal and external cohorts (the area under the curve of the joint-AI model ranged from 0.996 in the internal test dataset to 0.955, 0.924, and 0.976 in the 3 external test datasets, respectively). In addition, the performance of novice endoscopists was significantly enhanced with AI assistance. Meaning This study suggests that endoscopists of varying expertise can efficiently cooperate with this multimodal AI model, establishing a proof-of-concept study for human-AI interaction in the management of solid lesions in the pancreas.

produce the probability (a continuous variable from zero to one) of the input image.The final output of the Model-1 was binary, being either carcinoma (CA) or non-cancerous (non-CA).To prevent overfitting, three methods including data augmentation, dropout, and early stopping were adopted.2.1.Firstly, data augmentation was utilized.The augmentation techniques included random horizontal and vertical translations (range ratios of 20% and 30%, respectively), shear transformation (stretching images along the horizontal or vertical direction to a degree of 20 degrees), rotation (up to 10 degrees), zoom (maximum range of 30%), constant filling of empty regions after transformations, and random vertical and horizontal flipping.2.2.Secondly, a dropout layer was applied to the GAP layer with a rate of 0.50.2.3.Thirdly, early stopping was implemented to halt training when validation loss failed to decrease for ten consecutive epochs.Model-1 was configured to trained for a maximum of 200 epochs.While the actual number of training epochs for Model-1 was 64, due to the early stopping criteria.2.4.Grid search was used to find the optimal hyperparameters of the model.Specifically, the learning rate was chosen among 0.1, 0.01, and 0.001.The batch size was chosen among 16, 32, and 64.The dropout rate was chosen among 0.2, 0.3, 0.4, and 0.5.After conducting the grid search and evaluating the model's performance for each combination of hyperparameters, the learning rate was set to 0.001, the batch size was set to 16, and the dropout rate was set to 0.5.2.5.The model was trained by the Adam optimizer with an initial learning rate of 0.001, a batch size of 16, weight decay of 0.000001, and momentum of 0.90.

Model-2 (Selection of significant clinical features).
Feature selection was conducted to lower the risk of overfitting and reduce computation burden. 3.1.The training dataset used for the ML models was the same as Model-1, consisting of the clinical data collected from 351 patients at our center (WHTJH).3.2.Firstly, a total of 36 clinical features (sex, age, BMI, history of smoking, history of alcohol consumption, abdominal pain, weight loss, jaundice, diarrhea, vomiting, back pain, symptoms of hypoglycemia, weight gain, new-onset diabetes within 2 years, tumor history in other systems, chronic pancreatitis, long term diabetes, hepatitis B virus, hypertension, metformin, sulfonylureas, thiazolidinediones, insulin, direct bilirubin, CA19-9, CEA, amylase, lipase, appearance of the lesion including CT attenuation in the pancreatic parenchymal phase, MRI T1-weighted signal, MRI T2-weighted signal, DWI, presence of pancreatic duct dilation, presence of common bile duct dilation, presence of pancreatic enlargement, presence of pancreatic parenchymal atrophy) were categorized into five groups according to their nature: personal history, medical history, clinical symptoms, laboratory test and radiology findings.

3.3.
Next, features from the same category were arranged into various combinations to train several machine learning (ML) models.Given the differences in data types and characteristics among the five categories of clinical features, it was reasonable to expect that the optimal machine learning algorithm for capturing the relevant patterns and relationships within each category might differ.Therefore, multiple ML algorithms were employed during the training process, including Gaussian naive Bayes (GNB), k-nearest neighbors (KNN), logistic regression (LR), random forest (RF), decision tree (DT), support vector machine (SVM), and gradient boosting decision tree (GBDT).The probabilities of the patients having CA were produced based on the inputted clinical features, and the final output of the ML models was binary (CA or Non-CA).

3.4.
The optimal combinations of ML algorithm and features for each category were determined based on the diagnostic accuracy evaluated by clinical data of the 88 patients from the internal test dataset.

Model-3 (the joint-AI model).
4.1.The inputs to Model-3 included outputs from the linear layers of Model-1 and selected clinical features from Model-2.Because of the multilayer nature of the CNN network, the features extracted by layers closer to the output are more abstract. 4Therefore, three fusion strategies were used to generate the input vector for Model-3.

4.3.
Similarly, grid search was used to find the optimal hyperparameters of the model.The learning rate was set to 0.001, the batch size was set to 16, and the dropout rate was set to 0.5.4.4.This model was optimized by Adam with a learning rate set to 0.001, 5 and binary cross entropy was used as the loss function.To prevent overfitting, early stopping was implemented to halt training if the validation loss failed to decrease for ten consecutive epochs.Model-3 was configured to trained for a maximum of 40 epochs.While the actual number of training epochs for Model-3 was 24, due to the early stopping criteria.4.5.The diagnostic efficacy of models built using the three fusion strategies was evaluated.The model with the best performance was further evaluated in a prospective dataset. 5. Interpretability analysis.Firstly, gradient-weighted class activation mapping (Grad-CAM) was applied to Model-1. 6The heatmap generated by Grad-CAM indicated the regions within the EUS images that significantly influenced the predictions.On the other hand, shapley additive explanations (SHAP) was implemented to analyze the output of the Model-3. 7This approach provided both localized explanations, tailored to specific patients, and global explanations considering all instances of the model.Through SHAP, the contributions of individual elements in the prediction process were quantitatively indicated.Endoscopists were required to finish the questionnaire at the end of the study.The "joint-CNN" was the previous name of the "joint-AI" model.To avoid potential confusion, we changed to name to the "joint-AI" model when drafting this paper.The sample is the translated version, as the original one is written in Chinese.Representative EUS images and their corresponding Grad-CAM heatmaps.The heatmaps display the model's focused area within the EUS images.The upper pair presents a carcinoma lesion (A), while the lower pair exhibits a benign lesion resulting from chronic pancreatitis (B).The presence of a heated area in the Grad-CAM heatmap for the chronic pancreatitis can be attributed to its shared image features with the pancreatic cancer.However, despite the presence of these shared features, the model's predicted probability for the image of chronic pancreatitis does not exceed the diagnostic threshold for carcinoma, leading to a negative prediction.Grad-CAM, gradient-weighted class activation mapping.

eFigure 3 .
ROC Analyses of Different Feature Fusion StrategiesThe models developed by strategy A, B and C were compared on the internal testing dataset in image and patient phase.Strategy C, due to its direct fusion of predictions according to entire images and clinical features of patients, could only be evaluated in the patient phase.The strategy with the best performance (strategy B) was selected to develop the final joint-AI model.ROC, receiver operating characteristic; AUC, area under the curve.

eFigure 4 .eFigure 5 .
AI Models' Performance in Differentiating Carcinoma and Noncancerous Lesions in the Patient Phase The performance of the AI models in the patient phase.Model-3 was developed based on both clinical information and EUS images, whereas Model-1 was trained on EUS images only.The internal testing dataset was collected from WHTJH, Wuhan Tongji Hospital.Three external testing datasets were involved: NJDTH, Nanjing Drum Tower Hospital; PUMCH, Peking Union Medical College Hospital; BJFH, Beijing Friendship Hospital.ROC, receiver operating characteristic; AUC, area under the curve.The Grad-CAM Analysis Performance of Model-1 in Internal and External Datasets Selection of Significant Clinical Features From Individual Categories Performance of Different Fusion Strategies in the Image Phase and Patient Phase Performance of Individual Endoscopists on the Prospective Dataset Performance of Model-1 and Endoscopists Without AI-Assistance on the Prospective Dataset Performance of Model-3 and Endoscopists Without AI-Assistance on the Prospective Dataset The Rate of Endoscopists Rejecting the AI-Assistance Total rejection rate: (number of cases endoscopists disagree with prediction of the joint-AI) / (total number of cases) b.False rejection rate: (number of cases endoscopists falsely reject the prediction of the joint-AI) / (total number of cases endoscopists disagree with the prediction of the joint-AI) c.Odds ratio was calculated by the total rejection rate between novices and expert & senior endoscopists with or without the interpretability analysis eTable 9. Comparison of the Impact Between EUS-CNN and Joint-AI on the Decision-Making of Endoscopists a Questionnaire for Endoscopists on the Usage of the AI Models 8,9 eTable 1. Patient Demographics and Baseline Characteristics WHTJH, Wuhan Tongji Hospital; NJDTH, Nanjing Drum Tower Hospital; PUMCH, Peking Union Medical College Hospital; BJFH, Beijing Friendship Hospital © 2024 Cui H et al.JAMA Network Open.eTable 3. RF: random forest, DT: decision tree, SVM: support vector machine, GBDT: gradient boosted decision tree.a.A total of 24 features selected by the respective ML algorithms from the original 36 features.b.The set of the selected features demonstrated the highest accuracy within each category.©2024 Cui H et al.JAMA Network Open.eTable 4.